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1. INTRODUCTION 

The class imbalance occurs when a class has a significantly smaller number of instances than other 
classes, as determined by the imbalance ratio (IR), which is the ratio of a class with a significantly smaller 
number of instances (minority class) to a class with a significantly larger number of instances (majority class) 
[1] and basically machine learning algorithms work optimally if each class has a number of instances that are 
not much different [2]. This problem is one of the causes of the low accuracy of classification problems and 
also causes important information contained in the minority class can not be obtained due to better coverage 
on the majority class [3]. Handling of multi-class imbalance has greater difficulty compared to two-class 
problems, especially when it comes to accuracy and difficulty of training data on large datasets with high 
imbalance ratios [4]. Another thing that escapes attention is the overlapping problem, where several classes 
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overlap with other classes. The overlapping problem has far more impact on accuracy compared to class 
imbalance [5]. Overlapping conditions can increase the accuracy of one class by decreasing the accuracy of 
another class. For example, although the overlapping regions have a high concentration of minority classes, 
the classification results can also provide low accuracy because some instances associated with majority 
classes are eliminated [6]. 

The overlapping problem in multi-class can be overcome by using feature selection method which is 
very effective in dealing with overlapping problems [7]. The reason why feature selection is effective in 
dealing with overlapping is because of its ability to eliminate uninformative predictors and reduce 
dimensionality of feature space [8]. However, on the other hand with the noise, the performance given by 
feature selection can decrease [9] and noise basically has an influence on classification performance [10]. 
Noise handling in general uses the method of resampling, but often encounters obstacles if there is a state of 
overlapping [6]. This is an obstacle when handling multi-class imbalance and at the same time also faces 
other obstacles in the form of overlapping and noise. A number of studies have discussed handling class 
imbalance accompanied by overlapping or noise. Koziarski et al. [11] has proposed the a multi-class 
combined cleaning and resampling (MC-CCR) method which has the ability to overcome the noise problem 
but has obstacles in dealing with overlapping. The feature selection method on the other hand has been used 
by a number of researchers in dealing with overlapping problems, such as: [12] that has proposed density 
based feature selection and [13] that has proposed rough-set-based feature selection algorithm for imbalanced 
data (RSFSAID) algorithm. 

The ensemble learning approach, especially the hybrid ensembles, is very commonly used in 
overcoming multi-class imbalance problems [14]. The hybrid ensembles approach has a good ability in 
handling multi-class imbalances accompanied by overlapping, but what needs to be considered is in 
situations where the imbalance ratio is high and also conditions that contain noise and overlapping [15]. Xie 
et al. [16] has stated a number of things that need to be considered in improving performance on hybrid 
ensembles, such as the use of the right selection method on the noise label and also the right sampling at the 
processing stage. Research conducted by [17] and [18] shows that using the appropriate feature selection 
method at the preprocessing stage in the hybrid ensembles can provide a good result in handling class 
imbalance and overlapping. Noise handling in hybrid ensembles can be overcome by choosing the right 
sampling method at the processing stage [19], [20]. 

The hybrid ensembles approach that combines the application of preprocessing by using feature 
selection and sampling at the processing stage is hybrid approach redefinition-multiclass imbalance [21]. The 
hybrid approach redefinition-multi class (HAR-MI) approach will be combined with the resampling 
algorithm in the processing stage and feature selection in the preprocessing stage in this analysis. Selection 
under no sampling [22], [23] and selection under synthetic minority over-sampling technique (SMOTE) [24] 
are two feature selection approaches that can be used to overcome overlapping problems. According to 
research conducted by [5] the minimizing overlapping selection under no-sampling (MOSNS) and 
minimizing overlapping selection under SMOTE (MOSS) methods provide very satisfactory results in 
dealing with overlapping. Both methods have a similar level of efficiency. 

Noise handling by using resampling at the processing stage, there have been a number of studies 
that have been done. The research conducted by [25] uses the SMOTE Sampling method in handling noise 
but this research has problems with overlapping classes and also has problems in accuracy. The same thing is 
also found in using the SMOTE oversampling with edited nearest neighbors (ENN) method [26]. The 
MC-CCR system is one sampling method that produces excellent results. 

HAR-MI method which has good ability in overcoming multi-class imbalance problems but the 
result would be worse if there are overlapping between class and noise. In the preprocessing step, the HAR- 
MI system employs the random balance ensemble method, which will be paired with the MOSS method for 
the preprocessing stage, as well as the MC-CCR method for processing step. The findings will be compared 
to those obtained using the neighbourhood-based undersampling process, which is one of the best techniques 
for dealing with class imbalance and overlapping in multi-class imbalanced conditions [18]. Augmented 
r-value, class average accuracy, class balance accuracy, multi class G-mean, and uncertainty entropy were 
used to compare these results. 


2. RELATED WORKS 
2.1. Augmented R-value for multi-class 

Each class's R-value shows how much of an instance overlaps the area. R-value has a strong 
correspondence with classifier performance, according to research conducted by [27]. As can be seen in (1), 
[23] has suggested a method for calculating this. 
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vp Ce 1-ilR (Ci) 
Raug(I[V)) = R (1) 
7=0 l 


Where Co, Ci, .-, Ck-1 are k class labels with |Co] > |C,| = + > |Ck-1| and D[V]: Dataset D restraining 
predictors in set V. A higher Rġug indicates a higher overlap degree. 


2.2. Confusion matrix 

For the general classification results the classification results can be grouped into 4 (four) groups, 
namely; true positive (TP), true negative (TN), false positive (FP), and false negative (FN) and can be 
presented in the confusion matrix as can be seen in Table 1 [28]. 


Table 1. Confusion matrix for a classification problem 
Predicted (Classified) as 
Positive Class Negative 
Positive Samples TP FN 
Negative Samples FP TN 











Actually (Really is) 





2.3. Classifier performance 

The following parameters are used to determine the classifier performance. 
— Class average accuracy with C classes can be calculated using (2) [29]. 
tpjt+tn; 


1 
AvAcc == Sa 
Cc tpi+tni+fpi+fni 


(2) 


Where C is number of class with TP, TN, FP, and FN are the result of predicted (classified) that was obtained 
from confusion matrix. 
— Class balance accuracy for any C*, confusion matrix, class balance accuracy is defined as [30]. 


k Cii 
xi max 


CBA = Tid (3) 


According to confusion matrix for class balance accuracy as can be seen in Table 2. 


Table 2. Confusion matrix for class balance accuracy 











Predicted 
Class 1 Class2 Class3 Total 
Class 1 Cn Cr Cis Ci. 
Actual Class 2 C2 Cn C23 C2, 
Class 3 C31 Cx C33 C3. 
Total Cı C2 C3 N 





— The G-mean was proposed as the geometric mean of recall (R) values of two groups by multi class G- 
mean (mGM) [31]. Sun et al. [32] To apply this measure to multiple-class situations, define the G-mean 
as the geometric mean of each class's recall values. 





Ri = až (4) 


Des ij 
Circ 
mGM = “IIE, R; (5) 


— Confusion entropy (CEN). Wei et al. [33] proposed using the confusion entropy to determine classifier 
efficiency. According to the confusion entropy, the misclassification information includes both how the 
samples with true class label cl; were misclassified to the other N classes and how the samples from the 
other N classes were misclassified to class cli. 


CEN=Y, P,CEN; (6) 
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where 
7 ÈklCjk+Ck, j) 
P; 7 2 Xk, Ck, ; i $ ; o) 
CEN; er Dkei ej (Pj lOG2nP} x F Pr jlOJ2NPk,j (8) 


2.4. Minimizing overlapping selection under SMOTE 
The MOSS algorithm is being as [5]. 
1: X - matric with p predictors: X = [x Xas es xp]; class label: y 
2: Over-sampling the Positive Samples using SMOTE; merging the generated instances with original ones to 
get updated X-matrix, Xnew and updated class label Ynew 
3:X © XnewiY © Ynew 
4: Establish sparse regularization path B (A, a) according to (9) 
5: Compute the optimal (B,, Bo, ..., By)" via the (10) 
6: Select those feature with Ê; # 0 forj =1,2,...,p 


In (9) shows the sparse collection that would be used to create sparse regulatization [34]. 
1 
Ca) = 51 — a IBIS + alela (9) 


In (10) shows the loss penalties that will be used to determine the optimum B;. 


Loss = —=S7_4(yiB7x — In (1 + B7%:)) (10) 


The method of handling overlapping has started at the preprocessing stage by adding MOSS, as seen 
in the previous pseudocode. The MOSS process begins with the provision of p predictors and class labels y. 
The oversampling process in the minority class will be carried out using SMOTE, and then the sparse 
regularization process will be carried out using sparse collection, which can be measured using (2), and then 
loss penalties will be calculated using (3). 


2.5. Multi-class combined cleaning and resampling 
The MC-CCR algorithm is being as [11]. 
Input: x(denotes a subcollection of observations belonging to class in the Set of Observations x. 
Parameters: Each sphere has an energy budget for expansion, and the p-norm is used to calculate distance 
Output: Observations that have been translated and oversampled X 
1: Function MC-CCR (X, energi, p); 
2: C<—Collection of all classes;sorted by the number of associated observations in a descending order 
3: fori — 1 to lcl 
4: Netasses — number of classes with high number of observations than Ci 
5: if Nelasses > O then 
6: Xinin— X” 
T: Xmaj = Ø 
8: forj — 1 to Nelasses 


(ci) 
9: add | ix 


10: end 
11: Xmajs S CCR (Xma, Xmin energy, P) 
12: XCD — X uS 
13: Substitute observation used to construct Xmaj with Xm 
14: end if 
15: end 
16: return X 
The MC-CCR method is used to eliminate noise at the processing stage. It should be noted that the 
impact is limited and basically the combined majority observations. 





| randomly selected from x" to Xmaj 
Nclasses ` 


aj 


2.6. Hybrid approach redefinition for multi-class imbalance 
The algorithm of hybrid approach redefinition for multi-class imbalance is being as [21]. 
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Require: Set S of examples (x1, y1) 
Ensure: New set S of examples with Random Balance Ensemble Method 
: totalSize < ISI 
: Determine k using Dynamic Ensemble Selection 
: Building the candidate ensemble for Safe, Borderline, Rare, and Outlier 
: For all samples in Majority and Minority 
: Preprocessing Satge using Random Balance Ensemble Method 
: New Majority and New Minority of Preprocessing Dataset 
End 
: Determine the Augmented R-Value 
: For all instances in Preprocessing Dataset 
10: Determine Majority and Minority Class 
11: For All Instances in Majority Class 
12: Biased Support Vector Machine for Determine SV Sets and NSV Sets 
13: End 
14: For All Instances in Minority Class 
15: Biased Support Vector Machine for Determine SV Sets and NSV Sets 
16: End 
17: End 
18: For All Instances in NSV Sets from Majority Class 
19: Process Multiple Random Under Sampling 
20:End 
21:For All Instances in SV Sets from Minority Class 
22:Process SMOTEBoost 
23:End 
Based on the preceding algorithm, it is clear that the HAR-MI method is divided into 2 (two) major 
stages: preprocessing and processing. The random balance ensemble method and dynamic ensemble selection 
are used in the preprocessing stage. It is clear that the preprocessing stages will generate preprocessing 
datasets, which will then go through processing stages using different contribution sampling. There are 
biased support vector machine stages in different contribution sampling that will produce SV sets and no 
scalpel vasectomy (NSV) sets for both the majority and minority classes. NSV sets from majority classes are 
then processed using multiple random under sampling, while SV sets from minority classes are processed 
using SMOTE boost. 


3. RESEARCH METHOD 

Figure | shows the stages of this research. According to the previous Figure, the process will start 
with the preprocessing stage, which employs MOSS. The sparse selection and lasso penalty values are 
determined first. This stage's output will be a preprocessing dataset, which will then go through processing 
stages using MC-CCR. The results from HAR-MI with resampling and feature selection will then be 
compared to the results from neighborhood-based undersampling. 


3.1. Preprocessing stage 
MOSS will be used to modify the HAR preprocessing stage for multi-class problems. The following 
algorithm depicts the preprocessing stages. 
Require: S as set of instances (X; Yi) 
Ensure: S is the new Set with MOSS 
: totalSize < ISI 
: Calculate k as the number of Nearest Neighbors 
: For All Instances of S 
: Building the Positive or Minority Class Borderline as E,C;* 
: Building the Negative or Majority Class Borderline as Egg 
End 
: Creating a candidate ensemble based on k value for safe, borderline, rare, and outlier candidates 
: Using Equation 9 to Calculate Sparse Selection 
: Using Equation 10 to Calculate the Loss Penalty 
10: Determine Sparse Regulatization 
11: For All Instances in Positive 
12: MOSS is used to sample all instances 
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13: Build newMinority 

14: Build newMajority 

15: End 

16: Determine the Augmented R-Value 
17: return S` 


Calculating Sparse Selection 


ase and Lasso Penalty Preprocessing Using MOSS 










Preprocessing 


Processing Using MC-CCR 
Dataset 










Classification Result 


Result Using Combination of Result Using Neighbourhood- 
HAR-MI with Resampling Based Undersampling 
Algorithm and Feature Selection 


Comparison of Results Using 
Combination of HAR-MI with 
Resampling Algorithm and Feature 
Selection 
and Neighbourhood-Based 
Undersampling 





Figure 1. Stages of research methods 


Based on the preceding algorithm, it is clear that the HAR-MI preprocessing stage will be carried 


out using one of the feature selection methods, namely MOSS. MOSS is intended to do overlapping handling 
before entering the processing stage. This MOSS stage begins with determining the value of sparse selection 
and loss penalty. Furthermore, MOSS will be used to sample each instance in the minority class. The result is 
a preprocessing dataset which will then be measured in augmented R-value values. 


3.2. Processing stage 


The following algorithm depicts the processing steps. 


: For all samples in preprocessed dataset 

: Preprocessed Dataset should be added to S; 

: Using B-SVM, determine the SV and NSV sets for the majority and minority 
: For All Sampes in Negative 

: Checking and removing noise from SV and NSV sets with MC - CCR 


: For All Samples in Positive 
: Check and remove the Noise in SV Sets and NSV Sets using MC - CCR 





Hybrid approach redefinition-multi class with resampling and feature selection for... (Erianto Ongko) 


1724 O ISSN: 2302-9285 


9: SMOTEBoost Step for SV Sets and Produce SMOTESets 
10: End For 
11: For All SV Sets and NSV Sets from Majority Class do 
12: New NegativeSampleSets 
13: End For 
14: For All SV Sets and NSV Sets from Minority Class do 
15: New PositiveSampleSets 
16: End For 
17:End For 

According to the previous algorithm, the processing stage begins with the biased support vector 
machine process to determine SV Sets and NSV Sets for the majority and minority classes. The next step is 
for each SV Set and NSV Set in the majority class to go through the process of noise removal and resampling 
using the MC-CCR. On minority classes, the same thing will be done with SV sets and NSV sets. The 
SMOTEBoost process will be applied to SV sets in the minority class in particular. This whole process will 
result in a result dataset. 


4. RESULTS AND ANALYSIS 
4.1. Dataset description 

We conducted our experiments using 6 (six) multi-class imbalanced datasets from the knowledge 
extraction based on evolutionary learning (KEEL) repository, each with a low, moderate, or high. For 
datasets with a low IR are new-thyroid and balance, datasets with moderate infrared (IR) are flare and car, 
and dataset with high IR are red wine quality and yeast. Table 3 contains a description of the dataset [35]. 


Table 3. Description of dataset [35] 








Dataset #Ex #Atts Distribution of Class IR 
New-Thyroid 215 5 150/35/30 5 
Balance 625 4 288/49/288 5.88 
Flare 1066 11 331/239/21 1/147/95/43 7.7 
Car 1728 6 65/69/384/1210 18.61 
Red Wine Quality 1599 11 10/53/681/638/199/18 68.1 
Yeast 1484 8 463/5/35/44/51/163/244/429/20/30 92.6 





Following the selection of the dataset, the next step is to assess the presence of noise. This 
experiment will use a subset of training examples and randomly replace their labels to generate noise. This 
experiment will use a noise level of 0.1. 


4.2. Testing result 

The first test compares the augmented R-value and class average accuracy obtained by using the 
HAR-MI with resampling algorithm and feature selection. Table 4 shows the test results. According to 
Table 4, the results obtained by the HAR-MI method with resampling algorithm and feature selection and 
neighborhood-based undersampling are not significantly different in terms of overlapping. This is indicated 
by the value of augmented R-value which is not much different. The lower the augmented R-value, the lower 
the overlapping level. There is a strong relationship between overlapping and accuracy. The lower the 
overlapping, the better the average class accuracy obtained. It should also be noted that neighborhood-based 
undersampling tends to have a slight advantage in datasets with a large number of attributes such as flare and 
red wine quality. The HAR-MI with resampling algorithm and feature selection has the advantage of 4 other 
datasets. It should be noted that the imbalance ratio has an impact on the results. The higher the imbalance 
ratio, the more overlapping there will be, and the accuracy obtained will also be lower. 

The second test compares the class balance accuracy, multi class G-mean, and confusion entropy 
obtained by using the HAR-MI method with resampling algorithm and feature selection, as well as 
neighborhood-based undersampling. Table 5 shows the test results. Based on the Table 4, it is obvious that 
the number of attributes, the number of classes, and the level of IR all have a significant impact on class 
balance accuracy. The number of attributes and classes will largely determine the results of class balance 
accuracy for datasets with similar imbalance ratio levels. This can be seen in the dataset balance results for 
improved class balance accuracy when compared to the New-Thyroid dataset. When it comes to class 
balance accuracy, it can be seen that the HAR-MI with resampling algorithm and feature selection method 
produces better results than neighborhood-based undersampling. 
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Test results for multi class G-mean show that for both HAR-MI with resampling algorithm and 
feature selection and neighborhood-based undersampling, the higher the IR, the lower the multi class G-mean 
value obtained because G-means stated the equilibrium between positive samples and negative samples. The 
test results for confusion entropy show that the results obtained depend on the number of classes and 
imbalance ratios. The number of classes determines the results obtained for imbalance ratios that are not 
significantly different, such as those in the flare and car datasets. In general, the results obtained by the two 
methods for confusion entropy are not significantly different. 


Table 4. Augmented R-value and class average accuracy testing results 











Datäsët HAR-MI with Resampling Algorithm and Feature Selection Neighbourhood-Based Undersampling 
Augmented R-Value Class Average Accuracy Augmented R-Value Class Average Accuracy 

New-Thyroid 0.335 0.972 0.341 0.933 
Balance 0.327 0.865 0.345 0.891 
Flare 0.367 0.713 0.359 0.721 
Car 0.361 0.725 0.373 0.693 
Red Wine Quality 0.428 0.687 0.415 0.674 
Yeast 0.448 0.623 0.452 0.615 





Table 5. Class balance accuracy, multi class G-mean, and confusion entropy testing results for each method 
HAR-MI with Resampling Algorithm and 
Feature Selection 





Neighbourhood-Based Undersampling 








Datasët Class Balance Multi Class Confusion Class Balance Multi Class Confusion 
Accuracy G-Mean Entropy Accuracy G-Mean Entropy 

New-Thyroid 0.913 0.898 0.091 0.897 0.875 0.11 
Balance 0.927 0.915 0.105 0.911 0.897 0.12 
Flare 0.691 0.715 0.292 0.676 0.687 0.282 
Car 0.897 0.875 0.311 0.874 0.815 0.327 
Red Wine Quality 0.516 0.467 0.472 0.498 0.465 0.493 
Yeast 0.495 0.512 0.526 0.476 0.498 0.523 





4.3. Statistical tests 

The Wilcoxon signed-rank test is used to perform the statistical test, which is a statistical procedure 
in order to assess perfromance on the basis of pairwise comparisons [36]. The result for statistical tests can be 
seen in Table 6. 


Table 6. Wilcoxon signed-rank test in order to assess performance 





Performance P- 


Measurement Value Hypothesis 





Ho (no significant difference in score between HAR-MI with Resampling Algorithm, Feature Selection, 
Augmented 0.6875 and Neighbourhood-Based Undersampling) is accepted, which means H; (significant difference in 
R-Value 00 score between HAR-MI with Resampling Algorithm, Feature Selection, and Neighbourhood-Based 
Undersampling) is rejected because the p-value is greater than 0.05) 
Ho (no significant difference in score between HAR-MI with Resampling Algorithm, Feature Selection, 


rine s 0.3441 and Neighbourhood-Based Undersampling) is accepted, which means H; (significant difference in 
pen 18 score between HAR-MI with Resampling Algorithm, Feature Selection, and Neighbourhood-Based 


Undersampling) is rejected because the p-value is greater than 0.05) 
Ho (no significant score difference between HAR-MI with Resampling Algorithm and Feature 
Class Balance 0.0355 Selection and Neighbourhood-Based Undersampling) rejected, which means H; (there is a significant 
Accuracy 2 difference between HAR-MI with Resampling Algorithm and Feature Selection and Neighbourhood- 
Based Undersampling in score) Accepted because the p-value is less than 0.05 
Ho (no significant score difference between HAR-MI with Resampling Algorithm and Feature 
Multi Class 0.0312 Selection and Neighbourhood-Based Undersampling) rejected, which means H; (there is a significant 
G-Mean 500 difference between HAR-MI with Resampling Algorithm and Feature Selection and Neighbourhood- 
Based Undersampling in score) Accepted because the p-value is less than 0.05 
Ho (no significant difference in score between HAR-MI with Resampling Algorithm, Feature Selection, 
Confusion 0.1562 and Neighbourhood-Based Undersampling) is accepted, which means H; (significant difference in 
Entropy 50 score between HAR-MI with Resampling Algorithm, Feature Selection, and Neighbourhood-Based 
Undersampling) is rejected because the p-value is greater than 0.05) 





4.4. Discussion 
According to the results in Tables 4-6, there is no significant distintion in augmented R-value, class 
average accuracy, and confusion entropy between HAR-MI with resampling algorithms and feature selection 
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and neighborhood-based undersampling. It indicates that both methods have successfully overcome 
overlapping with positive outcomes. The confusion entropy obtained is good and this means that the 
misclassification is spread evenly for all classes. The test results for class balance accuracy and multi class G- 
mean show that HAR-MI with resampling algorithms and feature selection gives better results compared to 
neighborhood-based undersampling. 

It should be noted that overlapping and accuracy are two interrelated things, the higher the 
overlapping, the lower the accuracy. The imbalance ratio is the main factor that determines how much 
overlap there is. The higher the imbalance ratio, the higher the overlapping will be. In terms of overlapping, 
HAR-MI with resampling algorithms and feature selection has few limitations on datasets with a large 
number of attributes, in addition to imbalance ratios. The results revealed that, in addition to IR, the number 
of attributes and the number of classes determined the value of class balance accuracy. The number of classes 
and the imbalance ratio have a strong influence on multi class G-mean and confusion entropy. 


5. CONCLUSION 

Based on the test results that for handling multi-class imbalances accompanied by overlapping and 
noise, the results obtained by HAR-MI with resampling algorithms and feature selection and neighborhood- 
based undersampling are good. The results obtained by HAR-MI with resampling algorithms and feature 
selection are generally better than neighborhood-based undersampling. This is indicated by better augmented 
R-value, class average accuracy, class balance accuracy, multi-class G-mean, and confusion entropy. 
Although statistically for augmented R-value, class average accuracy, and confusion entropy based on the 
test results statistically it does not have too significant differences. It should be noted that HAR-MI with 
resampling algorithms and feature selection and neighborhood-based undersampling has limitations in 
handling overlapping, where there is a slight decrease in performance in datasets with large numbers of 
attributes. Imbalance ratio also has a direct relationship with the performance classifier. Future research is 
expected to be able to handle the decrease in performance in datasets with large number of attributes and also 
a high IR. 
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